Can I convert long binary strings into unique serial numbers of much shorter length?

Question

Petter Stefansson on 6 Sep 2016

0
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/302066-can-i-convert-long-binary-strings-into-unique-serial-numbers-of-much-shorter-length

Commented: Guillaume on 6 Sep 2016

I’m generating a very large number of long binary strings, for example:

S1 = [0 0 1 1 0 0 0 0 0 0 1 0 1 1 0 0 0 …]
S2 = [1 1 0 0 1 1 0 0 0 0 0 1 0 1 0 0 1 …]
S3 = …

Sometimes these strings are thousands of elements long and every time I generate a new string I would like to make sure that it is unique and has never been generated before. I could do this by simply storing all strings that has ever existed and using a function like ismember, isequal or something to compare every string to all previous strings, however, this takes a very long time and quickly becomes unfeasible as the list grows.

So I was wondering if there was a faster way of doing this. For example, could I convert every string into a unique serial number of much shorter length somehow so that S1 instead of being represented as a string became the number: 94237 (there is no logic behind this example number) and S 2 some other reasonably short number which perhaps would be much faster to compare uniqueness against than a long string? Is this possible and are there clever conventional ways of converting binary strings into unique serial numbers of minimal length? The serial number does not need to be backwards convertible into the string, all I need is to determine whether or not it is unique.

2 Comments
Show NoneHide None

Stephen23 on 6 Sep 2016

Edited: Stephen23 on 6 Sep 2016

Pick a hash function that suits your data and requirements:

https://en.wikipedia.org/wiki/Hash_function

Note that any hash loses information: hash functions may allow collisions between different data by giving them the same hash value.

Petter Stefansson on 6 Sep 2016

Thank you for the tip. Would you happen to know of any hash function that is particularly well suited for binary data and/or is already build into matlab?

Sign in to comment.

Sign in to answer this question.

Answer 1

Guillaume on 6 Sep 2016

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/302066-can-i-convert-long-binary-strings-into-unique-serial-numbers-of-much-shorter-length#answer_233956

Edited: Guillaume on 6 Sep 2016

Considering that encryptions keys are now getting to 2048 bits or more, 1000 bits is not that long.

How about grouping your bits in lumps of 64. converting that to 64 bits integer, and comparing these 64 bits numbers (with ismember indeed).

If it's not fast enough, then yes any hash function would work. Some, such as md5 are available on the file exchange or can be called directly from matlab using the java interface.

3 Comments
Show 1 older commentHide 1 older comment

Thorsten on 6 Sep 2016

Please have a look at http://stackoverflow.com/questions/8852668/what-is-the-clash-rate-for-md5

Guillaume on 6 Sep 2016

You're going to need to generate a LOT of numbers or be really unlucky before you see an md5 collision (as long as you don't deliberately try to create collisions).

And if you see a collision, you can simply compare the true binary vectors. 99.999999% you don't have collisions, so you guaranteed the two vectors are not identical with a fast hash comparison, and once in a blue moon you have a slower comparison.

Sign in to comment.

Answer 2

Thorsten on 6 Sep 2016

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/302066-can-i-convert-long-binary-strings-into-unique-serial-numbers-of-much-shorter-length#answer_233945

Edited: Thorsten on 6 Sep 2016

You said that you generate the numbers, but you do not mention any constraints on the generation. Without any constraints, just pick a number and append 1 to ensure that the numbers are unique.

If you do have constraints, it would be good to know them. hash functions are fine, but you cannot be 100% sure that there are no collisions, i.e., that you do not generate the same hash for the different inputs. It is just that the hash function tries so minimise the probability of such collisions.

5 Comments
Show 3 older commentsHide 3 older comments

Petter Stefansson on 6 Sep 2016

If I understand you correctly this way of numbering is just an S followed by a sequence of ones that grows by one element every time I generate a new binary vector? So by the time I have generated a million sequences the identification string is a million elements long? I’m sure I must misunderstand you somehow because that seems unreasonable.

Thorsten on 6 Sep 2016

Edited: Thorsten on 6 Sep 2016

Open in MATLAB Online

That's exactly what I've suggested. You didn't mention that you need millions of sequences. Do you know in advance an upper limit of the number of sequences you need?

Then you could generate these N unique sequences in decimal reprentation using

seq = randperm(N);

and convert to binary

 n = ceil(log2(N))
 binseq = dec2bin(seq, n);

and pick from the precomputed list of unique binary sequences, one after the other.

Sign in to comment.

Can I convert long binary strings into unique serial numbers of much shorter length?

2 Comments
Show NoneHide None

Accepted Answer

3 Comments
Show 1 older commentHide 1 older comment

More Answers (1)

5 Comments
Show 3 older commentsHide 3 older comments

See Also

Categories

Tags

Products

Community Treasure Hunt

Can I convert long binary strings into unique serial numbers of much shorter length?

2 Comments Show NoneHide None

Accepted Answer

3 Comments Show 1 older commentHide 1 older comment

More Answers (1)

5 Comments Show 3 older commentsHide 3 older comments

See Also

Categories

Tags

Products

Community Treasure Hunt

2 Comments
Show NoneHide None

3 Comments
Show 1 older commentHide 1 older comment

5 Comments
Show 3 older commentsHide 3 older comments