TXT Processed US 236m records from 263GB Uncompressed data
by andyleung0927 - April 26, 2021 at 01:59 PM
#13
(April 26, 2021 at 04:22 PM)pompompurin Wrote:
(April 26, 2021 at 03:52 PM)STARTEXMISLEAD Wrote: Good idea I just cant believe theres still no established source yet xd

They must’ve been whitelisting email domains & triple checking that emails exist before allowing people to continue. I am almost certain that this data is from some type of verifications.io//experian//other data on people.

It's a standard consumer data file, like Acuity/Experian/Acxiom. It was then changed by a middleman to include email fields, likely from "leads" files (where someone entered their name/phone/address on a website, and it was sold; most date back to around 2005-2010, with perhaps 75% of the newer ones being fake, slightly re-worked duplicates of the old ones). Then the company that had this file likely removed lots of fields (the real consumer data files often have ~400 fields), added a couple fields (just copying data, like the alphasort fields), changed a few, and converted it to JSON. It's unclear how much work was done by the middleman and how much was done by the end user.

Nobody will likely ever care what company the end user was. The middleman (who added the email addresses) probably won't be well known. The originator is almost certainly Acuity/Experian/Acxiom (it's hard to say for sure which because of the header names being changed).
Reply
#14
(April 26, 2021 at 04:22 PM)pompompurin Wrote:
(April 26, 2021 at 03:52 PM)STARTEXMISLEAD Wrote:
(April 26, 2021 at 03:50 PM)pompompurin Wrote:
(April 26, 2021 at 03:48 PM)STARTEXMISLEAD Wrote:
(April 26, 2021 at 03:46 PM)pompompurin Wrote: Quick message to mods: Don’t delete this for repost please, it’s reformatted & I was waiting for somebody to process the data like this, plus he deserves the credits for his work. Thanks 😁

Good point and I agree if all data is still included but its just parsed better I would suggest this one making official once the source of the data is identified.
I’m gonna do a full data parse later today & add it to my thread as well, make it easier for people to use

Good idea I just cant believe theres still no established source yet xd

This data is AMAZINGLY neat, I tried the trick where you check for the [email protected], and there was no results. I also tried to do the trick where you grab all the email domains, and sort by the amount of occurrences, and there are NO emails with “test” in them, or ANY type of spam/testing domains. I’m going to do the grueling task of identifying the few thousands of domains used for emails, and hopefully find some domains that might be the source (assuming people that owned this system tested it with their company emails). They must’ve been whitelisting email domains & triple checking that emails exist before allowing people to continue. I am almost certain that this data is from some type of verifications.io//experian//other data on people.

I did all of what you wrote @pompompurin already, only thing I noticed is that it started years ago ;-)
Reply
#15
(April 26, 2021 at 01:59 PM)andyleung0927 Wrote:
Hello RaidForums Community,

Thanks for @pompompurin, the original thread is  "https://raidforums.com/Thread-CSV-263GB-Leak-250-807-711-Total-records".  I select 1180 csv files which contain 'real' name and certain person from 280 csv files. And I choose some fields which I concern, then merge all the files into one txt file.
Compromised data: firstname,lastname,address,city,state,zip_code,email_1,land_phone,cellphone,geolocation,gender,birth_year,birth_month,occupation_code,ethnic_code,home_value,home_ownership,home_square_footage,home_dwelling_type_code,income_description,credit_capacity_description,marital_status_code,number_children_code,children_present_flag,email_2,email_3,email_4,email_5
contained lines: 236,000,000, 40.2GB (10 GB Compressed)
Sample:https://pastebin.com/piXfj9m9

[Hidden Content]

What I did not understand is why you removed most of the fields?.  You should have left it for the user to decided which ones to keep or delete. To me, this is not really helpful.
Reply
#16
Hi @ThinkingOne,

Would you by any chance happen to have any pdf data dictionaries handy for recent Acuity/Experian/Axiom?

I only have a single old Acuity data dictionary but very clear from that the data fields in ths data set released
by @pompompurin are a rather good match with Acuity as a majority (but not a complete) set of all Acuity fields
can be readily matched and cross-walked.

Also very clear somebody has definitely put a lot of time and effort enhancing this data set:.
Additional health, vehicle and housing data fields are present which I haven't seen before in
any similar big files like this...

Somebody has also taken a lot of time and care with the geocoding.
I can see that overall the geocoding is awesome and must have been done fairly recently
simply by putting the XY points on top of sat images and looking for new buildings.

All very, very impressive...

P.S. Many thanks @andyleung0927, very much appreciate your cleanup effort and upload here.
Very helpful.
Reply
#17
Bro, I removed some field which values have no clear meaning. And the original csv files have different header, it was hard to keep everything to one file. In addition, I list the lines clearly, beause I want to share to persons who wanted it really. If that is not really helpful, you can unlock the link. Thanks.

(April 27, 2021 at 01:09 AM)Ecopirate Wrote: Hi @ThinkingOne,

Would you by any chance happen to have any pdf data dictionaries handy for recent Acuity/Experian/Axiom?

I only have a single old Acuity data dictionary but very clear from that the data fields in ths data set released
by @pompompurin are a rather good match with Acuity as a majority (but not a complete) set of all Acuity fields
can be readily matched and cross-walked.

Also very clear somebody has definitely put a lot of time and effort enhancing this data set:.
Additional health, vehicle and housing data fields are present which I haven't seen before in
any similar big files like this...

Somebody has also taken a lot of time and care with the geocoding.
I can see that overall the geocoding is awesome and must have been done fairly recently
simply by putting the XY points on top of sat images and looking for new buildings.

All very, very impressive... 

P.S. Many thanks @andyleung0927, very much appreciate your cleanup effort and upload here.
Very helpful.
Thanks for your reply. I'm trying to figure out the meaning of each field. If someone has the dictionary, it is great. Because of the dictionary, I can choose more fields into one txt file. I can  upload the more valueable edition.

(April 26, 2021 at 03:46 PM)pompompurin Wrote: Quick message to mods: Don’t delete this for repost please, it’s reformatted & I was waiting for somebody to process the data like this, plus he deserves the credits for his work. Thanks 😁
Thank you for your understanding and encouragement.
Reply
#18
thanks both andyleung0927 and pompompurin,
Reply
#19
im fucking taking these nigger
Reply
#20
(April 26, 2021 at 04:22 PM)pompompurin Wrote: This data is AMAZINGLY neat, I tried the trick where you check for the [email protected], and there was no results. I also tried to do the trick where you grab all the email domains, and sort by the amount of occurrences, and there are NO emails with “test” in them, or ANY type of spam/testing domains. I’m going to do the grueling task of identifying the few thousands of domains used for emails, and hopefully find some domains that might be the source (assuming people that owned this system tested it with their company emails). They must’ve been whitelisting email domains & triple checking that emails exist before allowing people to continue. I am almost certain that this data is from some type of verifications.io//experian//other data on people.

Usually first few lines tell what company the data is from. Check first few emails and see the domain.



OP, thanks for the putting the data neatly together I need it badly and I was manually merging 50 files at once lol.

Hey can you use gofile for this please?

Mega has 3gb limit per day, It'll take 3 days to download this :(
Reply
#21
@redXXX, so you can write here what kind of company that owned this data?
Reply
#22
(April 27, 2021 at 09:27 AM)ForumRAID Wrote: @redXXX, so you can write here what kind of company that owned this data?

I saw the first few lines but there were no emails :(
Reply
#23
(April 27, 2021 at 08:24 AM)redXXX Wrote:
(April 26, 2021 at 04:22 PM)pompompurin Wrote: This data is AMAZINGLY neat, I tried the trick where you check for the [email protected], and there was no results. I also tried to do the trick where you grab all the email domains, and sort by the amount of occurrences, and there are NO emails with “test” in them, or ANY type of spam/testing domains. I’m going to do the grueling task of identifying the few thousands of domains used for emails, and hopefully find some domains that might be the source (assuming people that owned this system tested it with their company emails). They must’ve been whitelisting email domains & triple checking that emails exist before allowing people to continue. I am almost certain that this data is from some type of verifications.io//experian//other data on people.

Usually first few lines tell what company the data is from. Check first few emails and see the domain.



OP, thanks for the putting the data neatly together I need it badly and I was manually merging 50 files at once lol.

Hey can you use gofile for this please?

Mega has 3gb limit per day, It'll take 3 days to download this :(
1.I am uploading the file into gofile.io, but need more time.
2.My file merge order is disrupted, so you may not find the "first few lines".
Reply
#24
andyleung0927 Can u upload the scrypt itself(or full algorithm) that anyone can make their own version of this database.
Reply

Possibly Related Threads…
Thread Author Replies Views Last Post
CSV 263GB Leak [250,807,711 Total records] pompompurin 102 16,654 5 hours ago
Last Post: mechatron
Processed MyFitnessPal Database andyleung0927 7 1,760 Yesterday at 03:53 PM
Last Post: JasamTu
USA home owners data - 9,912,795Unique Records andyleung0927 43 12,066 Yesterday at 12:43 AM
Last Post: fizzb

 Users browsing this thread: 2 Guest(s)